2024-04-11
The tidymodels framework
tidymodels tools to develop a linear regression model
Then, an example of a Bayesian linear model fit without tidymodels (see Course Notes chapter 33).
Generally, regression allows us to summarize how predictions (or average values) of an outcome vary across individuals defined by a set of predictors.
Some of the most important uses of regression1 include:
If you’re struggling with this, or if your research question isn’t in the form of a question, consider these approaches.
If you’re doing something else, I still need to think that you meet standards (1) and (3) at least.
For linear models, we have:
lm to fit models for quantitative outcomes, compute and plot predictions and residuals, obtain confidence intervals, etc.ols from the rms package to save and explore additional components of the model’s fit and to (slightly) expand the capacity for lm fits to incorporate non-linear terms and multiple imputations.For logistic models, we have:
glm to fit models for binary outcomes, compute and plot predictions, hypothesis tests and confidence intervalslrm from rms to save and explore additional components of the model’s fit and to (slightly) expand the capacity for glm fits to incorporate non-linear terms and multiple imputations.These are by no means the only options for fitting or working with models.
tidymodels?The tidymodels collection of packages in R use tidyverse principles to facilitate modeling and machine learning work. The key idea is to develop a consistent framework for modeling.
Visit the tidymodels website at https://www.tidymodels.org/.
tidymodels framework?Install many of the packages in the tidymodels ecosystem with install.packages(tidymodels).
When you use library(tidymodels), this makes the core packages available in your R session. They include:
rsample which will help with data splitting and resamplingparsnip which provides a tidy, unified interface for modelsrecipes for data pre-processing and feature engineeringyardstick for measuring model effectivenessbroom for converting R objects into predictable formatsworkflows for bundling together pre-processing, modeling and post-processing workas well as dials and tune, which help manage and optimize tuning parameters in certain types of models.
Heart and Estrogen/Progestin Study (HERS)
hers_raw <- read_csv("c23/data/hersdata.csv", show_col_types = FALSE) |>
clean_names()
hers_new <- hers_raw |>
filter(diabetes == "no") |>
filter(complete.cases(ldl1, ldl)) |>
select(subject, ldl1, ldl, age, ht, globrat) |>
mutate(ht = factor(ht), globrat = factor(globrat),
subject = as.character(subject))Predict percentage change in ldl from baseline to followup, using baseline age, ht, ldl and globrat, restricted to women without diabetes.
hers_new Codebook (n = 1925)| Variable | Description |
|---|---|
subject |
subject code |
ht |
factor: hormone therapy or placebo |
ldl |
baseline LDL cholesterol in mg/dl |
age |
baseline age in years |
globrat |
baseline self-reported health (5 levels) |
ldl1 |
LDL at first annual study visit |
diabetes |
yes or no (all are no in our sample) |
hers_new tibble# A tibble: 1,925 × 6
subject ldl1 ldl age ht globrat
<chr> <dbl> <dbl> <dbl> <fct> <fct>
1 1 138. 122. 70 placebo good
2 2 151. 242. 62 placebo good
3 4 123. 116. 64 placebo good
4 5 172. 151. 65 placebo good
5 6 127. 138. 68 hormone therapy good
6 8 113. 121. 69 hormone therapy very good
7 9 106. 133 61 hormone therapy very good
8 10 173. 220 62 hormone therapy good
9 11 192 173. 72 placebo fair
10 12 149. 124. 73 hormone therapy good
# ℹ 1,915 more rows
Key Reference: Kuhn and Silge, Tidy Modeling with R or TMWR
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| −80.9 | −21.0 | −8.9 | 5.6 | 137.4 | −6.5 | 22.8 | 1,925.0 | 0.0 |
Here, we’ll use the rsample package to split our data.
We start with 1925 women in hers_new, which we split into 1540 women in the training sample, leaving 385 women in the testing sample.
rsample?initial_split(hers_new, prop = 0.8, strata = ht)
initial_time_split() to identify the first part of the data as the training set and the rest in the test set; this assumes the data were pre-sorted in a sensible order.The test set should always resemble new data we might give to the model.
A test set should be avoided only when the data are pathologically small.
A recipe for pre-modeling work might
https://www.tidymodels.org/find/recipes/ lists all recipes
hers_rec <- recipe(ldl_pch ~ age + ht + ldl + globrat, data = hers_new)
~ is an outcome.~ is a predictor.Sometimes we want to assign other roles, like “id” for an important identifier, or “split” for a splitting variable.
add_role(), remove_role() and update_role() functions are helpfulstep_log(x1, base = 10) (default base is exp(1)), step_sqrt, step_inversestep_BoxCox() will transform predictors using a simple Box-Cox transformation to make them more symmetric (remember this does require a strictly positive variable, and will be something we’d use more for an outcome using the residuals for a statistical model).step_YeoJohnson() uses the Yeo_Johnson transformation (again, typically on the outcome model) which is like Box-Cox but doesn’t require the input variables to be strictly positive.step_logit and step_invlogithttps://www.tidymodels.org/find/recipes/ lists available recipes
step_poly() produces orthogonal polynomial basis functionsstep_ns(x5, deg_free = 10) from the splines package can create things called natural splines - the number of spline terms is a tuning parameter, step_bs() adds B-spline basis functionsstep_dummy(all_nominal()) which converts all factor or categorical variables into indicator (also called dummy) variables: numeric variables which take 1 and 0 as values to encode the categorical information
all_numeric(), all_predictors() and all_outcomes()step_dummy(x2, x3)step_relevel() reorders the provided factor columns so that a level you specify is first (the baseline)step_unorder() to convert to regular factors or step_ordinalscore() to map specific numeric values to each factor levelstep_unknown() to change missing values in a categorical variable to a dedicated factor levelstep_novel() creates a new factor level that may be encountered in future datastep_other() converts infrequent values to a catch-all labeled “other” using a threshold
step_other(x5, threshold = 0.05) places bottom 5% of data in x5 into “other”.step_interact(~ interaction terms) can be used to set up interactionsstep_filter() can be used to filter rows using dplyr toolsstep_mutate() can be used to conduct a variety of basic operationsstep_ratio() can be used to create ratios of current variablesstep_zv() is the zero variance filter which removes variables that contain only a single value.step_nzv() removes variables with very few unique values or for whom the ratio of the frequency of the most common value to the second most common value is largestep_normalize() to center and scale quantitative predictorsstep_center() just centers predictorsstep_scale() just scales numeric data andstep_range() to scale numeric data to a specific rangestep_meanimpute() and step_medianimpute() to impute with mean or median,step_modelimpute() to impute nominal data using the most common value,step_bagimpute() for imputation via bagged trees,step_knnimpute() to impute via k-nearest neighborsstep_naomit() can be used to remove observations with missing valuesOther available engines for linear regression include:
stan to fit Bayesian models, spark, and kerasAll parsnip engines: https://www.tidymodels.org/find/parsnip/
stan modeling engine for fit2As an alternative, we’ll often consider a Bayesian linear regression model as fit with the “stan” engine. This requires the pre-specification of a prior distribution for the coefficients, for instance:
lm modellm model to the training samplefit1 results══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_bag()
• step_poly()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) age ldl_poly_1 ldl_poly_2
-6.3678 -0.7718 -7.8566 2.2361
ht_placebo globrat_fair globrat_good globrat_poor
4.8928 -0.7234 -1.5689 -1.1226
globrat_very.good
-1.3897
fit1?We know about tidy(), glance() and augment(), of course.
fit1 coefficientstidy(fit1, conf.int = T) |>
select(term, estimate, std.error, conf.low, conf.high) |>
gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 20)| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | −6.368 | 0.523 | −7.394 | −5.342 |
| age | −0.772 | 0.525 | −1.802 | 0.258 |
| ldl_poly_1 | −7.857 | 0.524 | −8.884 | −6.829 |
| ldl_poly_2 | 2.236 | 0.524 | 1.208 | 3.265 |
| ht_placebo | 4.893 | 0.523 | 3.866 | 5.920 |
| globrat_fair | −0.723 | 1.002 | −2.689 | 1.242 |
| globrat_good | −1.569 | 1.196 | −3.915 | 0.777 |
| globrat_poor | −1.123 | 0.572 | −2.244 | −0.001 |
| globrat_very.good | −1.390 | 1.114 | −3.574 | 0.795 |
fit1 summaries?stan modelstan model to the training samplefit2 results══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_bag()
• step_poly()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
stan_glm
family: gaussian [identity]
formula: ..y ~ .
observations: 1540
predictors: 9
------
Median MAD_SD
(Intercept) -6.3 0.5
age -0.7 0.5
ldl_poly_1 -7.8 0.6
ldl_poly_2 2.1 0.5
ht_placebo 4.8 0.5
globrat_fair -0.2 0.8
globrat_good -0.9 0.9
globrat_poor -0.9 0.5
globrat_very.good -0.8 0.8
Auxiliary parameter(s):
Median MAD_SD
sigma 20.5 0.4
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
fit2 coefficients?stan model requires the broom.mixed package to tidy fit.
broom.mixed::tidy(fit2, conf.int = T) |> gt() |>
fmt_number(decimals = 3) |> tab_options(table.font.size = 20)| term | estimate | std.error | conf.low | conf.high |
|---|---|---|---|---|
| (Intercept) | −6.286 | 0.516 | −7.149 | −5.447 |
| age | −0.725 | 0.497 | −1.536 | 0.092 |
| ldl_poly_1 | −7.787 | 0.550 | −8.724 | −6.898 |
| ldl_poly_2 | 2.145 | 0.519 | 1.313 | 3.011 |
| ht_placebo | 4.801 | 0.495 | 3.975 | 5.649 |
| globrat_fair | −0.250 | 0.783 | −1.541 | 1.023 |
| globrat_good | −0.929 | 0.918 | −2.486 | 0.566 |
| globrat_poor | −0.917 | 0.531 | −1.825 | −0.054 |
| globrat_very.good | −0.809 | 0.847 | −2.279 | 0.546 |
ggplot(coefs_comp, aes(x = term, y = estimate, col = mod,
ymin = conf.low, ymax = conf.high)) +
geom_point(position = position_dodge2(width = 0.4)) +
geom_pointrange(position = position_dodge2(width = 0.4)) +
geom_hline(yintercept = 0, lty = "dashed") +
coord_flip() +
labs(x = "", y = "Estimate (with 95% confidence interval)",
title = "Comparing the lm and stan model coefficients")Available regression performance metrics include:
rsq (r-squared, via correlation - always between 0 and 1)rmse (root mean squared error)mae (mean absolute error)rsq_trad (r-squared, calculated via sum of squares)fit1 in training samplefit2 in training sampleWe’ll see the results from each fit on the next slide.
fit1 and fit2 (training sample)| .metric | .estimator | .estimate |
|---|---|---|
| rsq | standard | 0.1797 |
| rmse | standard | 20.4622 |
| .metric | .estimator | .estimate |
|---|---|---|
| rsq | standard | 0.1795 |
| rmse | standard | 20.4650 |
The yardstick package doesn’t use adjusted \(R^2\).
tidymodels wants you to compute performance on a separate data set for comparing models rather than doing what adjusted \(R^2\) tries to do, which is evaluate the model on the same data as were used to fit the model.I wrote more on the adjusted \(R^2\) statistic in the Class 23 README.
lm_pred_test <-
predict(fit1, hers_test) |>
bind_cols(hers_test |> dplyr::select(ldl_pch))
lm_res_test <-
mets(lm_pred_test, truth = ldl_pch, estimate = .pred)
stan_pred_test <-
predict(fit2, hers_test) |>
bind_cols(hers_test |> select(ldl_pch))
stan_res_test <-
mets(stan_pred_test, truth = ldl_pch, estimate = .pred)fit1 and fit2 (test sample)| .metric | .estimator | .estimate |
|---|---|---|
| rsq | standard | 0.1998 |
| rmse | standard | 21.0827 |
| .metric | .estimator | .estimate |
|---|---|---|
| rsq | standard | 0.1999 |
| rmse | standard | 21.0856 |
tidymodels can accomplish.
We’ll apply ideas from the tidymodels framework to fit a logistic regression model.
tidymodelsThese are specific to the Bayes stuff I’ll discuss here.
See Chapter 33 of the Course Notes.
We’ll use the hers_new tibble in a complete case analysis, and fit a model to predict ldl_pch using age, ht, ldl and globrat.
| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| −80.91 | −21.04 | −8.87 | 5.58 | 137.36 | −6.47 | 22.79 | 1925 | 0 |
stan_glm
family: gaussian [identity]
formula: ldl_pch ~ age + ht + ldl + globrat
observations: 1923
predictors: 8
------
Median MAD_SD
(Intercept) 33.9 5.5
age -0.2 0.1
htplacebo 10.4 0.9
ldl -0.2 0.0
globratfair -2.0 2.3
globratgood -2.9 2.2
globratpoor -9.1 5.1
globratvery good -2.8 2.2
Auxiliary parameter(s):
Median MAD_SD
sigma 20.8 0.3
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Model Info:
function: stan_glm
family: gaussian [identity]
formula: ldl_pch ~ age + ht + ldl + globrat
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1923
predictors: 8
Estimates:
mean sd 10% 50% 90%
(Intercept) 33.9 5.5 27.1 33.9 40.8
age -0.2 0.1 -0.3 -0.2 -0.1
htplacebo 10.4 0.9 9.2 10.4 11.5
ldl -0.2 0.0 -0.2 -0.2 -0.2
globratfair -2.0 2.4 -5.0 -2.0 1.0
globratgood -3.0 2.2 -5.8 -2.9 -0.2
globratpoor -9.1 5.0 -15.6 -9.1 -2.7
globratvery good -2.8 2.3 -5.7 -2.8 0.2
sigma 20.8 0.3 20.3 20.8 21.2
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD -6.5 0.7 -7.3 -6.5 -5.6
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.1 1.0 3945
age 0.0 1.0 4597
htplacebo 0.0 1.0 5597
ldl 0.0 1.0 5743
globratfair 0.0 1.0 2310
globratgood 0.0 1.0 2198
globratpoor 0.1 1.0 3996
globratvery good 0.0 1.0 2192
sigma 0.0 1.0 5639
mean_PPD 0.0 1.0 4735
log-posterior 0.1 1.0 1560
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
[1] 4000
(Intercept) age htplacebo ldl globratfair globratgood
1 29.04810 -0.09509692 10.871120 -0.2089645 -2.23579086 -3.7337133
2 27.85075 -0.16064074 9.609842 -0.1922661 -0.48760101 -0.9138344
3 29.60465 -0.19588382 8.771109 -0.1912752 1.26874538 0.4195834
4 30.73136 -0.20668247 8.893424 -0.1942097 -0.07013866 0.7051094
5 30.04884 -0.13862132 12.536751 -0.2019070 -2.65790246 -4.1160047
6 31.86282 -0.15387960 12.136498 -0.2080974 -4.56255686 -3.6733781
globratpoor globratvery good
1 -9.365345 -3.3901981
2 -3.823223 -0.5027938
3 -5.843063 -0.5275769
4 -5.919552 0.4881570
5 -3.420240 -5.1095763
6 -4.769105 -4.8666684
age effectht effectldl effectglobrat effect (fair vs. excellent)| Summary of Posterior Distribution | |||||||
| Parameter | Median | 95% CI | pd | ROPE | % in ROPE | Rhat | ESS |
|---|---|---|---|---|---|---|---|
| (Intercept) | 33.94 | [ 23.22, 44.57] | 100% | [-2.28, 2.28] | 0% | 0.999 | 3945.00 |
| age | -0.18 | [ -0.31, -0.05] | 99.50% | [-2.28, 2.28] | 100% | 0.999 | 4597.00 |
| htplacebo | 10.38 | [ 8.52, 12.27] | 100% | [-2.28, 2.28] | 0% | 0.999 | 5597.00 |
| ldl | -0.21 | [ -0.24, -0.19] | 100% | [-2.28, 2.28] | 100% | 0.999 | 5743.00 |
| globratfair | -2.00 | [ -6.56, 2.72] | 80.85% | [-2.28, 2.28] | 53.84% | 1.000 | 2310.00 |
| globratgood | -2.92 | [ -7.29, 1.38] | 91.17% | [-2.28, 2.28] | 37.97% | 0.999 | 2198.00 |
| globratpoor | -9.12 | [-19.07, 0.48] | 96.60% | [-2.28, 2.28] | 6.45% | 1.000 | 3996.00 |
| globratvery good | -2.77 | [ -7.17, 1.69] | 88.90% | [-2.28, 2.28] | 41.79% | 1.000 | 2192.00 |
| Parameter | Prior_Distribution | Prior_Location | Prior_Scale |
|---|---|---|---|
| (Intercept) | normal | -6.46 | 57.00 |
| age | normal | 0.00 | 8.43 |
| htplacebo | normal | 0.00 | 113.99 |
| ldl | normal | 0.00 | 1.53 |
| globratfair | normal | 0.00 | 148.51 |
| globratgood | normal | 0.00 | 114.05 |
| globratpoor | normal | 0.00 | 548.28 |
| globratvery good | normal | 0.00 | 127.30 |
432 Class 23 | 2024-04-11 | https://thomaselove.github.io/432-2024/